Semantico-syntactic Tagging of Very Large Corpora: the Case of Restoration of Nodes on the Underlying Level

نویسندگان

Eva Hajicová

Petr Sgall

چکیده

The Prague Dependency Treebank has been conceived of as a semi-automatic three-layer annotation system, in which the layers of morphemic and 'analytic' (surface-syntactic) tagging are followed by the layer of tectogrammatical tree structures. Two types of deletions are recognized: (i) those licensed by the grammatical properties of the given sentence, and (ii) those possible only if the preceding context exhibits certain specific properties. Within group (i), either the position itself in the sentence structure is determined, but its lexical setting is 'free' (as e.g. with a deleted subject in Czech as a pro-drop language), or both the position and its 'filler' are determined. Group (ii) reflects the typological differences between English and Czech; the rich morphemics of the latter is more favorable for deletions. Several steps of the tagging procedure are carried out automatically, but most parts of the restoration of deleted nodes still have to be done "manually". If along with the node that is being restored, also nodes depending on it are deleted, then these are restored only if they function as arguments or obligatory adjuncts. The large set of annotated utterances will make it possible to check and amend the present results, also with applications of statistic methods. Theoretical linguistics will be enabled to check its descriptive framework; the degree of automation of the procedure will then be raised, and the treebank will be useful for most different tasks in language processing. 1. Introductory remarks The large corpus built in the Institute of Czech National Corpus (led by F. ýHUPiN DW WKH )DFXOW\ RI Philosophy, Charles University, Prague, now comprises more than 100 millions of word occurrences from different kinds of texts. A part of this corpus has been used as the basis of the Prague Dependency Treebank (PDT, see +DMLþ WKH VFHQDULR RI ZKLFK LV EDVHG RQ the conviction of the initiators of the project that the result of tagging is to be used both for the purposes of empirical and theoretical linguistic research and for its 'practical' applications, such as in dictionary making or in the buildup of different systems of natural language processing. The PDT is therefore conceived of as a semi-automatic WKUHH OD\HU DQQRWDWLRQ V\VWHP VHH +DMLþRYi LQ which the layers of morphemic and 'analytic' (surfacesyntactic) tagging are followed by a third layer, viz. that of tectogrammatical tree structures (TGTSs in the sequel). The TGTSs are intended to represent the underlying syntactic structure of sentences, which would be appropriate as the input to semantic(-pragmatic) interpretation, since the irregularities of the shallow layers, including synonymy and ambiguity, are absent on WKLV OHYHO VHH +DMLþRYi DQG WKH ZULWLQJV TXRWHG there). This implies that in the TGTSs, nodes for cases of (surface) deletions should be added ('reconstructed'). The transition from morphemic and analytic to tectogrammatical annotations is divided into three steps: first, an automatic procedure changes the morphemic tags into the corresponding grammatemes (values of morphological categories: tense, number, etc.), whenever possible, combining every analytic word form into a single node (the label of which contains the lexical lemma indexed with a string of grammatemes derived from endings and grammatical affixes, as well as from auxiliary verbs, articles, prepositions, conjunctions; the only exceptions are coordinating conjunctions, which retain their nodes as governors of the coordinated syntagm); second, a manual step is devoted to specify most of the syntactic relations (functors) and the more difficult cases of grammatemes (including those reflecting the topic-focus articulation of the sentence and corresponding movements); third, another automatic step takes care for specifications that can be carried out on the basis of the preceding step, i.e. after the syntactic functions of the lexical occurrences have been fully determined (cf. e.g. the values of the pronouns discussed under ex. (5) below). By now, 100 000 sentences from the Czech National Corpus have obtained their analytic annotations, and we expect to get thousands of sentences annotated by their TRs before the end of the year 2000. Hundreds of sentences (the 'large corpus', LC, have already been tectogrammatically tagged as for the main points, including the restoration of most of the deleted items. A more detailed annotation has been achieved, up to now, for about 100 sentences (the 'model corpus', MC). 2. Types of deletion Our preliminary analysis of the Czech National Corpus indicates that the following types of deletions have to be recognized: (i) deletions licensed by the grammatical properties of sentence elements or sentence structure, (ii) deletions possible only if the preceding context (be it co-text or context of situation) exhibits certain specific properties. Our subclassification of reconstructions of nodes can be compared with the kinds of 'silent' anaphora in the annotation scheme of the FrameNet project (Fillmore 1999); in the latter, it is especially the case (b) in Section 3 below (that of a "zero morph") the counterpart of which has been elaborated in detail. We do not go so far e.g. in the analysis of deverbative nouns, i.e. we just exclude the Actor from the valency frame of an agentive noun, such as writer, instead of characterizing the suffix as filling this slot. Other participants of verbs and deverbative nouns are restored, even if their head itself has been deleted and has to be added. In the LC, we in principle do not restore any (deleted) complementations of nouns except for the case of the maximally productive deverbatives with the protypical suffix -á/aní, -tí (eg. þHNiQt 'waiting' from the verb þHNDW 'to wait'; we distinguish between psaní 'a letter' and 'writing' as in Dostali jsme psaní 'We got a letter' and Psaní mu trvalo hodinu 'Writing took him an hour'). 3. Grammatical identification of the deleted item Within group (i), two situations may obtain: (a) Only the position itself in the sentence structure is predetermined (i.e. a sentence element is subcategorized for this position), but its lexical setting is 'free'. This is e.g. the case given by the so-called pro-drop character of a language like Czech, where the position of the subject of a verb is 'given', but it may be filled in dependence on the context, cf. (1): (1) 3 HGVHGD vlády HNO åH S HGORåt návrh na ]P QX volebního systému. 'The Prime-minister said that (he the Prime-minister, the Government, or somebody else identifiable on the basis of the context) will submit a proposal on the change of the election system'). Here also belong cases of the semantically obligatory but deletable complementations of verbs: e.g. the Cz. verb S LMHW 'to arrive' has as its obligatory complementation an Actor and a Directional "where-to" (the obligatoriness of the Directional complementation can be tested by a question test, see Panevová 1974; Sgall et al. 1986), which can be deleted on the surface, cf. (2); here the Directional (here or there) is deleted because the speaker assumes that the hearer will identify the referent easily). (2) Vlak S LMHGH Y šest hodin. 'The train will arrive at six o'clock.' Also a subject to a verb is supplied if it fails to be expressed in the surface, Cz. being a pro-drop language. A node with a label containing the lemma of a personal pronoun (including the anaphoric 3rd person pronoun) is added, and its values of gender and number are specified according to the congruent form of the verb and to what has been understood from the intraor intersentential context; the restered node obtains also a functor (ACT Actor, Dir-3 Directional 'where to'). In the following examples, the added values are inserted in square brackets; a restored node always is marked as deleted (elided) by the value ELID: (3) [My.ANIM.PL.ACT.ELID] Byli jsme tam všichni. 'We all were there.' (4) Marie a Jana [tam.DIR-3.ELID] S LãO\ D [on.FEM.PL.ACT.ELID] posadily se na pohovku; 'Mary and Jane came [there] and [they] sat down on the sofa. (5) ' WL rozbily okno, ale [on.FEM.PL.ACT.ELID] omluvily se. The children broke a window, but [they] apologized. The value of Number with 1st pers. pronouns will be supplied both in MC and in LC by the second phase of the automatic procedure, on the input of which the subjectverb agreement has been specified. With the 3rd pers. pronouns, in MC also the functor of antecedent and its serial number in the word order will be marked as values of specific attributes (distinguishing whether the antecedent occurs in the given sentence or in its predecessor in the text). The supplied word is always placed to the left of its governing word (should more of them be inserted into one and the same place, then it must conform to the systemic ordering, i.e., ACT followed by most of the free modifications, then ADDR, PAT and EFF in this order). (b) Both the position and its 'filler' is predetermined; this situation might be described as the presence of a "zero morph" rather than deletability, especially in case the deletion is obligatory. An example of the function of a zero morph are the so-called General Participants: (6) Ta kniha [Gen.ACT.ELID] E\OD Xå vydána dvakrát. 'The book has already been published twice.' Actor) (7) V QHG OL >Gen.PAT.ELID] obvykle SHþX 'lit.: On Sundays (I) usually bake' (8) ' GHþHN >Gen.ADDR.ELID] vypravuje pohádky. 'Grandfather tells fairytales' Also the phenomena of 'control' belong here, see +DMLþRYi 3DQHYRYi DQG 6JDOO 4. Contextual identification of the deleted item Group (ii) consists of deletion conditioned by the context, with which the item to be restored is determined by the context alone. This is a point where the typological differences between English and a language with rich inflection, such as Czech, are most clearly to be seen; the rich morphemics allows for deletions in many cases in which a deletion is impossible in the English text. To put it in an extreme way, in principle everything in any position can be deleted in Czech if it is identifiable on the basis of the context; this is not the case in English, cf. e.g. the deletion of the whole topic of the sentence in (9): (9) (Potkal jsi YþHUD Toma?) Potkal. 'lit.: (Did you meet Tom yesterday?) Met'. Along with these rather specific cases (in which the verb in a typical context does require the Objective to be present also in the surface), two cases are characteristic of contextual deletion in Czech: (a) The restored node (i.e. deleted in the surface) is a governor of a congruent adjective which has the functor ExD in the manually prepared analytic trees): (10) 3 LãOL jen [ten.ACT.Plur.ELID] mladší. °Only (the) younger [ones] came.' (11) Našli jen [ten.PAT.Plur.ELID] modré. °They only found (the) blue (ones).° This doesn°t concern those adjectival words with which we assume a substantival function as well: such pronouns as ten (to) 'this', Q NWHUê 'some', cardinal numerals, superlatives, and of course the 'substantivized adjectives', (nemocný, °ill', UDQ Qê 'wounded°, etc.): (12) Zvolili W L 3$7 ] S WL PtVWRS HGVHG$ °They elected three from the five vice-presidents° (13) (3 LSUDYLOL YHþH L SUR deset KRVW$ 3 LãOL jen

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deletions and their reconstruction in tectogrammatical syntactic tagging of very large corpora

The procedure of reconstruction of the underlying structure of sentences (in the process of tagging a very large corpus of Czech) is described, with a special attention paid to the conditions under which the reconstruction of ellipted nodes is carried out. 1. The tagging scenarios with different (degrees and types of) theoretical backgrounds have undergone a rather rapid development flom morpho...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

The Impact of Different Frequency Patterns on the Syntactic Production of a 6-year-old EFL Home Learner: A Case Study

This longitudinal study investigated the impact of different Frequency Patterns (FP) on the syntactic production of a six-year-old EFL learner in a home context. Target syntactic constructions were presented using games and plays and were traced for their occurrence patterns in input and output. Following each instruction period, the constructions were measured through immediate and delayed ora...

متن کامل

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...

متن کامل

Feature extraction in opinion mining through Persian reviews

Opinion mining deals with an analysis of user reviews for extracting their opinions, sentiments and demands in a specific area, which can play an important role in making major decisions in such area. In general, opinion mining extracts user reviews at three levels of document, sentence and feature. Opinion mining at the feature level is taken into consideration more than the other two levels d...

متن کامل

Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging

Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Semantico-syntactic Tagging of Very Large Corpora: the Case of Restoration of Nodes on the Underlying Level

نویسندگان

چکیده

منابع مشابه

Deletions and their reconstruction in tectogrammatical syntactic tagging of very large corpora

An improved joint model: POS tagging and dependency parsing

The Impact of Different Frequency Patterns on the Syntactic Production of a 6-year-old EFL Home Learner: A Case Study

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

Feature extraction in opinion mining through Persian reviews

Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging

عنوان ژورنال:

اشتراک گذاری